Test Driven Development of Scientific Models 

SEA Conference - April 7-11, 2014 
Boulder, CO 


Tom Clune 


Advanced Software Technology Group 
Computational and Information Sciences and Technology Office 
NASA Goddard Space Flight Center 


Tom Clune (ASTG) 


TDD 


1/37 



Q Motivations 


Q Testing 

o Testing Frameworks 
o Test-driven Develompent (TDD) 
Q What about numerical software? 


April 7-11, 2014 2/ 37 


Tom Clune (ASTG) 



Motivation 1: Fear/Stress 


Tom Clune (ASTG) 


TDD - SEA Conference 



April 7-11, 2014 3 / 37 



Motivation 1: Fear/Stress 



Tom Clune (ASTG) 


TDD - SEA Conference 



April 7-11, 2014 3 / 37 




Motivation 1: Fear/Stress 




www . cart oonst ock . com 


April 7-11, 2014 3 / 37 


Tom Clune (ASTG) 




April 7-11, 2014 3 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 




Change 


Verify 



April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



o New 
feature 

a Refactor 



April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



9 New 
feature 

a Refactor 



9 Compiles? 


April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



9 New 
feature 

a Refactor 



9 Compiles? 
9 Executes? 


April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



9 New 
feature 

a Refactor 



9 Compiles? 
9 Executes? 
9 Looks ok? 


April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



9 New 
feature 

a Refactor 



9 Compiles? 
9 Executes? 
9 Looks ok? 

9 Really 
correct? 


April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



9 New 
feature 

a Refactor 



9 Compiles? 
9 Executes? 
9 Looks ok? 

9 Really 
correct? 


April 7-11, 2014 4 / 37 


Tom Clune (ASTG) 



Motivation 2: Productivity 



9 New 
feature 

a Refactor 



9 Compiles? 
9 Executes? 
9 Looks ok? 

9 Really 
correct? 


What is the latency of verification for large scientific models? 
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Some observations about human behavior: 



o Risk of defects scales with magnitude of change per iteration 
» Development time per iteration will be comparable to verification time 


Conclusion: 

Productivity is a nonlinear function of the cost of verification! 
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Motivation 3: The Limelight 



Climate modeling has grown to be of extreme socioeconomic 
importance: 


Pearce, Fred. “Top economist counts future cost of climate change.” 

NewScientist. 30 October 2006. http://www.newscientist.com/article/ 
dnl0405- top- economist- counts- future- cost -of -climate- change .html 
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Software management and testing have not kept pace 

► Strong validation against data, but ... 

► Validation is a blunt tool for isolating issues in coupled systems 

► Validation cannot detect certain types of software defects: 

★ Those that are only exercised in rare/future regimes 
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Test Harness - work in safety 

Collection of tests that constrain system 


» Detects unintended changes 
« Localizes defects 
9 Improves developer confidence 
9 Decreases risk from change 

9 Inexpensive compared to application (ideally) 
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Do you write legacy code? 



"The main thing that distinguishes legacy code from non-legacy code is 
tests, or rather a lack of tests." 

Michael Feathers 

Working Effectively with Legacy Code 



8 Lack of tests leads to fear of introducing 
subtle bugs and/or changing things 
inadvertently. 

8 Also is a barrier to involving pure 

software engineers in the development of 
our models. 
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Excuses, excuses ... 


« Takes too much time to write tests 
« Too difficult to maintain tests 
® It takes too long to run the tests 
® It is not my job 
® Dont know correct behavior 

http : //java. dzone . com/articles/unit- test -excuses 

- James Sugrue 

® IMumeric/scientific code cannot be tested, because ... 



TDD - SEA Conference 


April 7-11, 2014 12 / 37 


Tom Clune (ASTG) 



Desirable attributes for tests: 


Tom Clune (ASTG) 


TDD - SEA Conference 



April 7-11, 2014 


13 / 37 



Desirable attributes for tests: 

9 Narrow/specific 

► Failure of a test localizes defect to small section of code. 



April 7-11, 2014 13 / 37 


Tom Clune (ASTG) 



Desirable attributes for tests: 

9 Narrow/specific 

► Failure of a test localizes defect to small section of code. 

9 Orthogonal to other tests 

► Each defect causes failure in one or only a few tests. 



TDD - SEA Conference 


April 7-11, 2014 13 / 37 


Tom Clune (ASTG) 



Desirable attributes for tests: 

9 Narrow/specific 

► Failure of a test localizes defect to small section of code. 
9 Orthogonal to other tests 

► Each defect causes failure in one or only a few tests. 

9 Complete 

► All functionality is covered by at least one test. 

► Any defect is detectable. 



April 7-11, 2014 13 / 37 


Tom Clune (ASTG) 



Desirable attributes for tests: 

9 Narrow/specific 

► Failure of a test localizes defect to small section of code. 
9 Orthogonal to other tests 

► Each defect causes failure in one or only a few tests. 

9 Complete 

► All functionality is covered by at least one test. 

► Any defect is detectable. 

9 Independent - No side effects 

► No STDOUT; temp files deleted; ... 

► Order of tests has no consequence. 

► Failing test does not terminate execution. 




April 7-11, 2014 13 / 37 


Tom Clune (ASTG) 




Desirable attributes for tests: 

9 Narrow/specific 

► Failure of a test localizes defect to small section of code. 
9 Orthogonal to other tests 

► Each defect causes failure in one or only a few tests. 

9 Complete 

► All functionality is covered by at least one test. 

► Any defect is detectable. 

9 Independent - No side effects 

► No STDOUT; temp files deleted; ... 

► Order of tests has no consequence. 

► Failing test does not terminate execution. 

9 Frugal 

► Execute quickly (think 1 millisecond) 

► Small memory, etc. 



April 7-11, 2014 13 / 37 


Tom Clune (ASTG) 



Desirable attributes for tests: 

9 Narrow/specific 

► Failure of a test localizes defect to small section of code. 
9 Orthogonal to other tests 

► Each defect causes failure in one or only a few tests. 

9 Complete 

► All functionality is covered by at least one test. 

► Any defect is detectable. 

9 Independent - No side effects 

► No STDOUT; temp files deleted; ... 

► Order of tests has no consequence. 

► Failing test does not terminate execution. 

9 Frugal 

► Execute quickly (think 1 millisecond) 

► Small memory, etc. 

9 Automated and repeatable 



April 7-11, 2014 13 / 37 


Tom Clune (ASTG) 



Desirable attributes for tests: 
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► Failure of a test localizes defect to small section of code. 
9 Orthogonal to other tests 

► Each defect causes failure in one or only a few tests. 

9 Complete 

► All functionality is covered by at least one test. 

► Any defect is detectable. 

9 Independent - No side effects 

► No STDOUT; temp files deleted; ... 

► Order of tests has no consequence. 

► Failing test does not terminate execution. 

9 Frugal 

► Execute quickly (think 1 millisecond) 

► Small memory, etc. 

9 Automated and repeatable 
9 Clear intent 
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Testing Frameworks 



Framework 

Driver 


User 

Tests 

Framework 

Services 

Application 


run 

tests 


Report: 

1271 tests run 
2 Failures 
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Testing Frameworks 



® Key 


services 

Provide methods to succinctly express expected values 
call assertEqual ( 120 , f actor ial (5) ) 

Register test procedures with framework 

Execute test procedures, and summarize success/failure 
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Testing Frameworks 




® Key services 

► Provide methods to succinctly express expected values 
call assertEqual (120 , f actor ial (5) ) 

► Register test procedures with framework 

► Execute test procedures, and summarize success/failure 

« Generally specific/customized to programming language (xUnit) 

► Java (JUnit) 

► Python (pyUnit) 

► C++ (cxxUnit, cppUnit) 

► Fortran (FRUIT, FUNIT, pFUnit) 
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Frameworks and IDE’s 



Frameworks are often integrated within IDEs for even greater ease of use: 
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9 Tests written by separate team (black box testing) 
9 Tests written after implementation 

Consequences: 

9 Testing schedule compressed for release 
9 Defects detected late in development ($$) 


New paradigm - Test-driven development (TDD) 

9 Developers write the tests (white box testing) 

9 Tests written before production code 
9 Enabled by emergence of strong unit testing frameworks 
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The TDD cycle 


focus on interface 



Extend 

Tests 

iL 


Refactor 



focus on algorithm 
Fix/Extend *• 
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» High reliability - (excellent test coverage) 
o Always “ready-to-ship” 

9 Tests act as maintainable documentation 

► Tests show real use case scenarios 

► Tests are continuously exercised (TDD process) 

9 Reduced stress / improved confidence 
9 Improved productivity 
9 Predictable schedule 
9 High quality implementation? 

► Emphasis on interfaces 

► Testable code is cleaner code. 
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« Presence of numerical error (roundoff or truncation) 

9 Lack of known (nontrivial) solutions 
9 Irreducible complexity? 

9 Stability - issues that occur after long integrations 
9 Emergent properties of coupled systems (including stability) 
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Numerical error 



Testing numerical algorithms requires an accurate estimate for tolerance: 
a If too low, then test fails for uninteresting reasons. 

9 If too high, then the test has no teeth. 

Unfortunately ... 

9 Error estimates are seldom available for complex algorithms 
9 Best case scenario is usually some asymtotic form with unknown 
leading coefficient! 
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TDD techniques in presence of numerical error 

Sources: 

O Approximation 

Q Nonlinearity - e.g., small denominators 
Q Composition and iteration 

Mitigation strategies: 

Q Approximation: 

► Test the implementation not the math (i.e., duck) 

► Often more appropriate as validation test 

Q Nonlinearity - use tailored synthetic inputs: 

► E.g., choose values to make denominators 0(1) 

Q Composition/iteration: test steps in isolation: 

► Allows choice of tailored synthetic inputs at each step 

► Test iteration logic not accumulation 
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Example - testing layers in isolation 



Consider the main loop of a climate model: 

Do test 

a Proper # of iterations 
a Pieces called in correct order 

a Passing of data between 
components 

Do NOT test 

a Calculations inside components 
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Finalize 
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Easier with objects than with procedures. 
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TDD without “known” solutions 


Consider the apparent contradiction: 


Tom Clune (ASTG) 
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® Complex algorithms yield few nontrivial analytic solutions. 
® Implementations are not random keystrokes 
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TDD without “known” solutions 

Consider the apparent contradiction: 

® Complex algorithms yield few nontrivial analytic solutions. 
® Implementations are not random keystrokes 

How can this be? 

® Apparently analytic solutions are unnecessary I 
® Algorithms are only sequences of steps 
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TDD without “known” solutions 

Consider the apparent contradiction: 

® Complex algorithms yield few nontrivial analytic solutions. 
® Implementations are not random keystrokes 

How can this be? 

® Apparently analytic solutions are unnecessary I 
® Algorithms are only sequences of steps 



Tests should only verify translation, not validity of algorithms 

« Test each step in isolation 

« Tailor synthetic inputs to yield “obvious” results for each step 
a Separately test that steps are composed correctly 


3? . 
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TDD without “known” solutions 

Consider the apparent contradiction: 

® Complex algorithms yield few nontrivial analytic solutions. 
® Implementations are not random keystrokes 

How can this be? 

® Apparently analytic solutions are unnecessary I 
® Algorithms are only sequences of steps 



Tests should only verify translation, not validity of algorithms 

« Test each step in isolation 

« Tailor synthetic inputs to yield “obvious” results for each step 
a Separately test that steps are composed correctly 
But still use high level analytic solutions as tests when available! 
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TDD and irreducible complexity 



“Aren’t my tests as complex as the implementation?” 

“Aren’t my tests just repeating logic in the implementation?” 
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TDD and irreducible complexity 



“Aren’t my tests as complex as the implementation?” 

“Aren’t my tests just repeating logic in the implementation?” 

® Short answer: No 

9 Long answer: Well, they shouldn’t be ... 

► Unit tests use tailored inputs 

► Implementation handles arbitrary values 
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TDD and irreducible complexity 



“Aren’t my tests as complex as the implementation?” 

“Aren’t my tests just repeating logic in the implementation?” 

® Short answer: No 

9 Long answer: Well, they shouldn’t be ... 

► Unit tests use tailored inputs 

► Implementation handles arbitrary values 

► Models couple many components/algorithms => exponential complexity 

► Tests are decoupled =£* linear complexity 
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TDD and emergent properties 



» TDD generally does not directly address such issues 
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TDD and emergent properties 



» TDD generally does not directly address such issues 
9 If long integration gets bad results, (at least) one of the following 
must hold: 

Q Individual steps have defects => add unit tests 
Q Coupling/compositions have defects => add tests 
Q System lacks sufficient accuracy =>- increase accuracy 
Q Insufficient physical fidelity - science issue (testing is not magic) 
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TDD and emergent properties 



» TDD generally does not directly address such issues 
9 If long integration gets bad results, (at least) one of the following 
must hold: 

Q Individual steps have defects => add unit tests 
Q Coupling/compositions have defects => add tests 
Q System lacks sufficient accuracy =>- increase accuracy 
Q Insufficient physical fidelity - science issue (testing is not magic) 


9 At the very least, TDD can reduce the frequency with which one 
must perform long integrations 
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TDD and performance 



® TDD emphasizes small fine-grained implementations 
o Such implementations are often sub-optimal in terms of performance 
9 Optimized implementations typically fuse multiple operations 
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TDD and performance 



® TDD emphasizes small fine-grained implementations 
o Such implementations are often sub-optimal in terms of performance 
9 Optimized implementations typically fuse multiple operations 
9 Solution: bootstrapping 

► Use initial TDD solution as unit test for optimized implementation 

► Maintain both implementations (and tests) 
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TDD and the burden of legacy code 



a TDD was created for developing new code, and does not directly 
speak to testing legacy code. 

9 Best practice for incorporating new functionality: 

► Avoid wedging new loging directly into existing large procedure 

► Use TDD to develop separate facility for new computation 

► Just call the new procedure from the large legacy procedure 
9 Refactoring 

► Use unit tests to constrain existing behavior 

► Very difficult for large procedures 

► Try to find small pieces to pull out into new procedures 
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Summary 



a TDD can be applied to scientific models 
a Tool support exists (unabashed plug for pFUnit tutorial) 
a Cost/benefit analysis for numerical software needs further study 


Tom Clune 

Thomas . L . CluneOnasa . gov 
http: //pf unit . sourceforge.net 

Test-Driven Development: By Example - Kent Beck 


TDD - SEA Conference 


April 7-11, 2014 37 / 37 


Tom Clune (ASTG) 



